In this exercise, we will require the tidyverse, knitr and janitor packages.

library(tidyverse)
library(knitr)
library(janitor)
library(gt)

The goal of this exercise is to try some of the data manipulation options available in R. The data we are looking at today are songs downloaded from Spotify via the spotifyr package. You can even use this package to download your own playlists!

(a) Read in and explore the data set.

The file is called spotify_songs.csv and you can read the file in using the read_csv function. Choose songs as the name of the data frame if you want to be consistent with the rest of the exercise and the solutions we provide. Explore the variables in RStudio or using code. A detailed summary of each of the variables can be found here: https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-21/readme.md.

songs <- read_csv("spotify_songs.csv")
Rows: 32833 Columns: 23
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): track_id, track_name, track_artist, track_album_id, track_album_na...
dbl (13): track_popularity, danceability, energy, key, loudness, mode, speec...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(songs)
Rows: 32,833
Columns: 23
$ track_id                 <chr> "6f807x0ima9a1j3VPbc7VN", "0r7CVbZTWZgbTCYdfa…
$ track_name               <chr> "I Don't Care (with Justin Bieber) - Loud Lux…
$ track_artist             <chr> "Ed Sheeran", "Maroon 5", "Zara Larsson", "Th…
$ track_popularity         <dbl> 66, 67, 70, 60, 69, 67, 62, 69, 68, 67, 58, 6…
$ track_album_id           <chr> "2oCs0DGTsRO98Gh5ZSl2Cx", "63rPSO264uRjW1X5E6…
$ track_album_name         <chr> "I Don't Care (with Justin Bieber) [Loud Luxu…
$ track_album_release_date <chr> "2019-06-14", "2019-12-13", "2019-07-05", "20…
$ playlist_name            <chr> "Pop Remix", "Pop Remix", "Pop Remix", "Pop R…
$ playlist_id              <chr> "37i9dQZF1DXcZDD7cfEKhW", "37i9dQZF1DXcZDD7cf…
$ playlist_genre           <chr> "pop", "pop", "pop", "pop", "pop", "pop", "po…
$ playlist_subgenre        <chr> "dance pop", "dance pop", "dance pop", "dance…
$ danceability             <dbl> 0.748, 0.726, 0.675, 0.718, 0.650, 0.675, 0.4…
$ energy                   <dbl> 0.916, 0.815, 0.931, 0.930, 0.833, 0.919, 0.8…
$ key                      <dbl> 6, 11, 1, 7, 1, 8, 5, 4, 8, 2, 6, 8, 1, 5, 5,…
$ loudness                 <dbl> -2.634, -4.969, -3.432, -3.778, -4.672, -5.38…
$ mode                     <dbl> 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, …
$ speechiness              <dbl> 0.0583, 0.0373, 0.0742, 0.1020, 0.0359, 0.127…
$ acousticness             <dbl> 0.10200, 0.07240, 0.07940, 0.02870, 0.08030, …
$ instrumentalness         <dbl> 0.00e+00, 4.21e-03, 2.33e-05, 9.43e-06, 0.00e…
$ liveness                 <dbl> 0.0653, 0.3570, 0.1100, 0.2040, 0.0833, 0.143…
$ valence                  <dbl> 0.518, 0.693, 0.613, 0.277, 0.725, 0.585, 0.1…
$ tempo                    <dbl> 122.036, 99.972, 124.008, 121.956, 123.976, 1…
$ duration_ms              <dbl> 194754, 162600, 176616, 169093, 189052, 16304…

(b) Remove the playlist ID variable (playlist_id) from the data set.

Use the select() function to remove the variable. This requires you to use a negative sign.

Give this new data frame a new name.

songs2 <- songs %>%
  select(-playlist_id)

(c) Rename the variable for mode to be more explanatory

Use the rename() function to rename the variable mode so that it indicates that it is a mode of the key in particular.

Create a new code chunk that includes the code from (b) as well, so that it can all be run together as one chunk.

songs2 <- songs %>%
  select(-playlist_id) %>%
  rename(key_mode = mode)

(d) Change the scale of the danceability variable to be out of 100

The danceability variable is currently a number from 0 to 1. Use the mutate() function to create a new percentage danceability from 0 to 100.

Include all the code in one chunk again, as you continue building on it.

songs2 <- songs %>%
  select(-playlist_id) %>%
  rename(key_mode = mode) %>%
  mutate(dance100 = danceability*100)

(e) Order the data from highest danceability to lowest danceability

Use the arrange() function to order the data frame, on your new danceability scale. Recall that desc() within arrange uses descending order.

What are some of the top songs for danceability?

Add the existing code to this chunk as well like above.

songs2 <- songs %>%
  select(-playlist_id) %>%
  rename(key_mode = mode) %>%
  mutate(dance100 = danceability*100) %>%
  arrange(desc(dance100))
songs2
# A tibble: 32,833 × 23
   track_id      track…¹ track…² track…³ track…⁴ track…⁵ track…⁶ playl…⁷ playl…⁸
   <chr>         <chr>   <chr>     <dbl> <chr>   <chr>   <chr>   <chr>   <chr>  
 1 0U7sbXtHiRvv… If Onl… Fusion…      41 0QOi08… If Onl… 2016-0… House/… edm    
 2 4uHMfLdd3IbY… Mega R… DJ Zsu…      27 2n9oVY… Mega R… 2018-0… Electr… edm    
 3 3XVozq1aeqsJ… Ice Ic… Vanill…      70 20O6lf… Vanill… 2008-1… 90s Da… pop    
 4 4a5nDDqiQX6a… Enseña… DJ Goo…      65 2RMlIj… Enseña… 2019-0… Verano… latin  
 5 3Xv5C02Wxlek… Cha Ch… DJ Cas…      54 3Ogg26… Cha Ch… 2004-0… School… latin  
 6 4dA9s3ai98TH… Slow D… India.…      27 5gnsCH… Voyage… 2002-0… Neo So… r&b    
 7 01IQ4aQgOf0K… Funky … Dave         72 3CFVTs… Funky … 2018-1… Rap Wo… rap    
 8 5vOLEEbDyprZ… Get Do… Centra…       5 13MEhm… Underw… 2004-0… Chican… latin  
 9 1GeNui6m825V… Bad Ba… Young …      81 1bnHPO… So Muc… 2019-0… Hip-Ho… rap    
10 4TJ56OkWrnf2… In da … Trick …      48 4uHDWJ… Thug H… 2002-0… Southe… rap    
# … with 32,823 more rows, 14 more variables: playlist_subgenre <chr>,
#   danceability <dbl>, energy <dbl>, key <dbl>, loudness <dbl>,
#   key_mode <dbl>, speechiness <dbl>, acousticness <dbl>,
#   instrumentalness <dbl>, liveness <dbl>, valence <dbl>, tempo <dbl>,
#   duration_ms <dbl>, dance100 <dbl>, and abbreviated variable names
#   ¹​track_name, ²​track_artist, ³​track_popularity, ⁴​track_album_id,
#   ⁵​track_album_name, ⁶​track_album_release_date, ⁷​playlist_name, …

(f) Select only those songs with high danceability

Use the filter() function to select songs with danceability greater than or equal to 90%.

How many songs are in this category? Use tabyl to see how many highly danceable songs are in each genre. Be careful not to override the original data set with this table. There is no need to assign this tabyl to an object unless you intend to refer to it later.

It would be useful to know how the proportions of each genre compare to the entire set of songs as well. So instead of filtering by high danceability, try creating a binary variable that indicates high danceability (90 or more) compared with the alternative, using the mutate() function. You may like to save this version of the data set with a new name.

songs %>%
  select(-playlist_id) %>%
  rename(key_mode = mode) %>%
  mutate(dance100 = danceability*100) %>%
  arrange(desc(dance100)) %>%
  filter(dance100>=90) %>%
  tabyl(playlist_genre) %>%
  adorn_pct_formatting() %>%
  gt()
playlist_genre n percent
edm 93 12.5%
latin 99 13.3%
pop 47 6.3%
r&b 120 16.1%
rap 368 49.5%
rock 17 2.3%

There seem to be a large number of rap songs with high danceability.

songs2 <- songs %>%
  select(-playlist_id) %>%
  rename(key_mode = mode) %>%
  mutate(dance100 = danceability*100) %>%
  arrange(desc(dance100)) %>%
  mutate(highdance = dance100>=90) %>%
  mutate(highdance = replace(highdance, highdance == FALSE, "low")) %>%
  mutate(highdance = replace(highdance, highdance == TRUE, "high"))

songs2 %>%
  tabyl(playlist_genre, highdance) %>%
  gt()
playlist_genre high low
edm 93 5950
latin 99 5056
pop 47 5460
r&b 120 5311
rap 368 5378
rock 17 4934

Use the replace() function within mutate to give this new variable a more informative level names (rather than the default TRUE or FALSE). From here we can see that not only are there a large number of rap songs with high danceability; the highest percentage of high danceability songs is also found in the rap genre.

Note also that we separate the data manipulation step from the tabyl step, so that the updated data (and not the table) are saved with the name songs2, which we can refer to below.

(g) Produce some plots to summarise the relationships between some variables.

Consider the relationship between popularity (track_popularity) and danceability for the full data set. What visual display could you use to explore this relationship?

Hint: there are a large number of data points, so there will be a lot of overlap in the points. There are a few techniques that can help. For example, you can add additional information with a geom_smooth() curve. You can also summarise the points in each location with something like geom_hex() (requiring package hexbin), or make the points transparent (geom_point(alpha = 0.1)) or very small (geom_point(shape= ".")) Play around with these options to see what you think works best. Don’t be surprised if these figures take some time to generate as there are a large number of data points for each.

How might you explore this relationship for different genres (playlist_genre)?

ggplot(songs2,
       aes(x = dance100,
           y = track_popularity)) +
  geom_point()

ggplot(songs2,
       aes(x = dance100,
           y = track_popularity)) +
  geom_point() + geom_smooth()
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

ggplot(songs2,
       aes(x = dance100,
           y = track_popularity)) +
  geom_point(alpha = 0.1) 

ggplot(songs2,
       aes(x = dance100,
           y = track_popularity)) +
  geom_point(shape = ".") 

library(hexbin)
ggplot(songs2,
       aes(x = dance100,
           y = track_popularity)) +
  geom_hex() 

Here are a range of options. For the genre figure, let’s add colour and smoothed lines by genre.

ggplot(songs2,
       aes(x = dance100,
           y = track_popularity,
           colour = playlist_genre)) +
  geom_point() + geom_smooth()
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

It is still a bit hard to see what is happening with all the genres on the same plot, so we will also try panelling by genre and add some transparency as well.

ggplot(songs2,
       aes(x = dance100,
           y = track_popularity)) +
  geom_point(alpha = 0.1) + geom_smooth() +
  facet_wrap(vars(playlist_genre), nrow=3)
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

In general, there is not evidence of a strong relationship between danceability and popularity of a track, though there are some slight positive trends (for example, for rap).

(h) Extension exercises.

Consider some other relationships between variables that you are interested in and create the code to explore these, with summary tables and/or visual displays.

Download the airport screening file used in lectures and perform some of your own data transformations and summaries of these variables.


© 2022 Statistical Consulting Centre, The University of Melbourne.